Datasets, Corpora and other Language Resources
نویسندگان
چکیده
Abstract This chapter provides an overview of what is available in ELG terms datasets, corpora and other language resources (LRs) how this has been achieved. We look at the procedures steps that have followed to complete full resource ingestion cycle, which goes from repository LR identification metadata description ingestion. explain approaches, priorities methodology. The also outlines repositories integrated into ELG, discussing different (metadata conversion, extraction, completion, as well harvesting) reasons behind these choices. Furthermore, catalogue content described, with details on key elements features accomplishments. last two sections are devoted crucial legal issues such a complex platform its data management plan, respectively.
منابع مشابه
Word Alignment for Languages with Scarce Resources Using Bilingual Corpora of Other Language Pairs
This paper proposes an approach to improve word alignment for languages with scarce resources using bilingual corpora of other language pairs. To perform word alignment between languages L1 and L2, we introduce a third language L3. Although only small amounts of bilingual data are available for the desired language pair L1-L2, large-scale bilingual corpora in L1-L3 and L2-L3 are available. Base...
متن کاملExATOlp: extraction of language resources from Portuguese corpora
This paper presents four main features of the ExATOlp software tool. These features provide the following language resources: corpus relevant terms and their morpho-syntactic and frequency features; concordancer (terms contexts); concept tags; and concept hierarchies. The emphasis of the tool relies on the high quality of extracted terms. The provided resources offer a concise representation of...
متن کاملOvercoming the Sparseness Problem of Spoken Language Corpora Using Other Large Corpora of Distinct Characteristics
This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness prob...
متن کاملA Web-Platform for Preserving, Exploring, Visualising, and Querying Linguistic Corpora and other Resources
We present SPLICR, the Web-based Sustainability Platform for Linguistic Corpora and Resources. The system is aimed at people who work in Linguistics or Computational Linguistics: a comprehensive database of metadata records can be explored in order to find language resources that could be appropriate for one’s specific research needs. SPLICR also provides a graphical interface that enables user...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Cognitive technologies
سال: 2022
ISSN: ['2197-6635', '1611-2482']
DOI: https://doi.org/10.1007/978-3-031-17258-8_8